Number of observations: 2930
Number of features: 82
2025-04-12
The Ames Housing Dataset is a popular dataset that provides detailed information about residential properties in Ames, Iowa. It was originally used in a Kaggle competition and has become widely used for machine learning and data analysis practice, particularly in regression tasks.
Link to AMES Housing data documentation
| Variable | Variable Type | Description |
|---|---|---|
| SalePrice | Numerical | Sale price of the house (target variable) in USD |
| MSSubClass | Categorical | Identifies the type of dwelling |
| MSZoning | Categorical | General zoning classification |
| LotFrontage | Numerical | Linear feet of street connected to property |
| LotArea | Numerical | Lot size in square feet |
| Street | Categorical | Type of road access |
| Alley | Categorical | Type of alley access |
| LotShape | Ordinal | General shape of property |
| LandContour | Categorical | Flatness of the property |
| Utilities | Categorical | Type of utilities available |
| LotConfig | Categorical | Lot configuration |
| LandSlope | Ordinal | Slope of property |
| Neighborhood | Categorical | Physical locations within Ames |
| Condition1 | Categorical | Proximity to main roads/railroads |
| Condition2 | Categorical | Additional proximity to main roads/railroads |
| BldgType | Categorical | Type of dwelling |
| HouseStyle | Categorical | Style of dwelling |
| OverallQual | Ordinal | Overall material and finish quality |
| OverallCond | Ordinal | Overall condition rating |
| YearBuilt | Numerical | Original construction year |
| YearRemodAdd | Numerical | Remodel year (same as construction year if no remodel) |
| RoofStyle | Categorical | Type of roof |
| RoofMatl | Categorical | Roof material |
| Exterior1st | Categorical | Primary exterior material |
| Exterior2nd | Categorical | Secondary exterior material |
| MasVnrType | Categorical | Masonry veneer type |
| MasVnrArea | Numerical | Masonry veneer area in square feet |
| Foundation | Categorical | Type of foundation |
| BsmtQual | Ordinal | Height of basement |
| BsmtCond | Ordinal | General basement condition |
| BsmtExposure | Ordinal | Walkout or garden-level basement walls |
| BsmtFinType1 | Ordinal | Quality of finished basement area |
| BsmtFinSF1 | Numerical | Type 1 finished square feet |
| BsmtFinType2 | Ordinal | Quality of second finished basement area |
| BsmtFinSF2 | Numerical | Type 2 finished square feet |
| BsmtUnfSF | Numerical | Unfinished basement square feet |
| TotalBsmtSF | Numerical | Total basement square feet |
| Heating | Categorical | Type of heating |
| HeatingQC | Ordinal | Heating quality and condition |
| CentralAir | Categorical | Central air conditioning (Y/N) |
| Electrical | Categorical | Electrical system type |
1stFlrSF |
Numerical | First floor square footage |
2ndFlrSF |
Numerical | Second floor square footage |
| LowQualFinSF | Numerical | Low-quality finished square feet |
| GrLivArea | Numerical | Above ground living area square feet |
| BsmtFullBath | Numerical | Number of full bathrooms in basement |
| BsmtHalfBath | Numerical | Number of half bathrooms in basement |
| FullBath | Numerical | Number of full bathrooms above grade |
| HalfBath | Numerical | Number of half bathrooms above grade |
| BedroomAbvGr | Numerical | Number of bedrooms above basement level |
| KitchenAbvGr | Numerical | Number of kitchens |
| KitchenQual | Ordinal | Kitchen quality |
| TotRmsAbvGrd | Numerical | Total number of rooms above ground (excluding bathrooms) |
| Functional | Ordinal | Home functionality rating |
| Fireplaces | Numerical | Number of fireplaces |
| FireplaceQu | Ordinal | Fireplace quality |
| GarageType | Categorical | Location of garage |
| GarageYrBlt | Numerical | Year garage was built |
| GarageFinish | Ordinal | Interior finish of garage |
| GarageCars | Numerical | Size of garage in car capacity |
| GarageArea | Numerical | Size of garage in square feet |
| GarageQual | Ordinal | Garage quality |
| GarageCond | Ordinal | Garage condition |
| PavedDrive | Ordinal | Paved driveway (Y/N) |
| WoodDeckSF | Numerical | Wood deck area in square feet |
| OpenPorchSF | Numerical | Open porch area in square feet |
| EnclosedPorch | Numerical | Enclosed porch area in square feet |
3SsnPorch |
Numerical | Three-season porch area in square feet |
| ScreenPorch | Numerical | Screen porch area in square feet |
| PoolArea | Numerical | Pool area in square feet |
| PoolQC | Ordinal | Pool quality |
| Fence | Ordinal | Fence quality |
| MiscFeature | Categorical | Miscellaneous feature (Shed, Tennis Court, etc.) |
| MiscVal | Numerical | Value of miscellaneous feature |
| MoSold | Numerical | Month sold |
| YrSold | Numerical | Year sold |
| SaleType | Categorical | Type of sale |
| SaleCondition | Categorical | Sale condition |
ames_data <- read.csv("AmesHousing.csv")
cat("Number of observations:", nrow(ames_data), "\n")
cat("Number of features:", ncol(ames_data), "\n") Number of observations: 2930
Number of features: 82
Null and missing values
missing_data <- colSums(is.na(ames_data))
cat("Missing values per feature:\n")
print(missing_data[missing_data > 0])Missing values per feature:
Lot.Frontage Alley Mas.Vnr.Area Bsmt.Qual Bsmt.Cond
490 2732 23 79 79
Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2
79 79 1 79 1
Bsmt.Unf.SF Total.Bsmt.SF Bsmt.Full.Bath Bsmt.Half.Bath Fireplace.Qu
1 1 2 2 1422
Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area
157 159 157 1 1
Garage.Qual Garage.Cond Pool.QC Fence Misc.Feature
158 158 2917 2358 2824
We are eliminating missing or null values in the dataset
# Categorical variables
ames_data$Alley[is.na(ames_data$Alley)] <- "None"
ames_data$Pool.QC[is.na(ames_data$Pool.QC)] <- "None"
ames_data$Fence[is.na(ames_data$Fence)] <- "None"
ames_data$Misc.Feature[is.na(ames_data$Misc.Feature)] <- "None"
# Numerical variables (use median for continuous variables)
ames_data$Mas.Vnr.Area[is.na(ames_data$Mas.Vnr.Area)] <- median(ames_data$Mas.Vnr.Area, na.rm = TRUE)
# Impute missing Lot.Frontage with the median value
ames_data$Lot.Frontage[is.na(ames_data$Lot.Frontage)] <- median(ames_data$Lot.Frontage, na.rm = TRUE)
# basement-related values (assume 'None' for categorical and 0 for numerical)
ames_data$Bsmt.Qual[is.na(ames_data$Bsmt.Qual)] <- "None"
ames_data$Bsmt.Cond[is.na(ames_data$Bsmt.Cond)] <- "None"
ames_data$Bsmt.Exposure[is.na(ames_data$Bsmt.Exposure)] <- "None"
ames_data$BsmtFin.Type.1[is.na(ames_data$BsmtFin.Type.1)] <- "None"
ames_data$BsmtFin.SF.1[is.na(ames_data$BsmtFin.SF.1)] <- 0
ames_data$BsmtFin.Type.2[is.na(ames_data$BsmtFin.Type.2)] <- "None"
ames_data$BsmtFin.SF.2[is.na(ames_data$BsmtFin.SF.2)] <- 0
ames_data$Bsmt.Unf.SF[is.na(ames_data$Bsmt.Unf.SF)] <- 0
ames_data$Total.Bsmt.SF[is.na(ames_data$Total.Bsmt.SF)] <- 0
ames_data$Bsmt.Full.Bath[is.na(ames_data$Bsmt.Full.Bath)] <- 0
ames_data$Bsmt.Half.Bath[is.na(ames_data$Bsmt.Half.Bath)] <- 0
# Garage-related features (assume 'None' for categorical and 0 for numerical)
ames_data$Garage.Type[is.na(ames_data$Garage.Type)] <- "None"
ames_data$Garage.Yr.Blt[is.na(ames_data$Garage.Yr.Blt)] <- 0
ames_data$Garage.Finish[is.na(ames_data$Garage.Finish)] <- "None"
ames_data$Garage.Qual[is.na(ames_data$Garage.Qual)] <- "None"
ames_data$Garage.Cond[is.na(ames_data$Garage.Cond)] <- "None"
ames_data$Garage.Cars[is.na(ames_data$Garage.Cars)] <- 0
ames_data$Garage.Area[is.na(ames_data$Garage.Area)] <- 0
# Fireplaces
ames_data$Fireplace.Qu[is.na(ames_data$Fireplace.Qu)] <- "None"The dataset has been streamlined to include 11 key features that are most relevant for predicting the sale price of a property. This refined selection incorporates both numerical values and binary categorical indicators, capturing crucial aspects such as living space, property age, garage capacity, and available amenities. The chosen features are:
SalePrice: The target variable representing the final sale price of the property.
TotalBathrooms: A combined count of all full and half bathrooms.
TotalSquareFootage: The total area of the home, combining both above-ground living space and basement space.
HouseAge: The age of the house at the time it was sold.
TotalGarageSize: A combined measure of garage square footage and car capacity.
TotalPorchArea: The sum of all porch spaces, including open, enclosed, and screened porches.
HasPool: A binary indicator showing whether the property includes a pool.
HasFence: A binary indicator for the presence of a fence.
HasFireplace: A binary variable denoting whether the house has one or more fireplaces.
HasMiscFeature: A binary indicator for the presence of additional features like sheds or tennis courts.
OverallQualityScore: A composite rating reflecting the overall quality and condition of the home.
These features were carefully chosen due to their strong influence on property value and were derived by combining or transforming relevant variables from the original dataset.
ames_data$TotalBathrooms <- ames_data$Full.Bath + (ames_data$Half.Bath * 0.5) + ames_data$Bsmt.Full.Bath + (ames_data$Bsmt.Half.Bath * 0.5)
ames_data$TotalSquareFootage <- ames_data$Gr.Liv.Area + ames_data$BsmtFin.SF.1 + ames_data$BsmtFin.SF.2 + ames_data$X1st.Flr.SF + ames_data$X2nd.Flr.SF
# Calculate House Age
ames_data$HouseAge <- ames_data$Yr.Sold - ames_data$Year.Built
# Calculate Garage Age (if Garage.Yr.Blt exists)
ames_data$GarageAge <- ifelse(!is.na(ames_data$Garage.Yr.Blt), ames_data$Yr.Sold - ames_data$Garage.Yr.Blt, NA)
# Calculate Remodel Age (if Year.Remod.Add exists)
ames_data$RemodelAge <- ifelse(!is.na(ames_data$Year.Remod.Add), ames_data$Yr.Sold - ames_data$Year.Remod.Add, NA)
# Combine all three into a new feature for total property age considering remodel/garage build
ames_data$TotalPropertyAge <- pmin(ames_data$HouseAge, ames_data$GarageAge, ames_data$RemodelAge, na.rm = TRUE)
ames_data$TotalGarageSize <- ames_data$Garage.Area + (ames_data$Garage.Cars * 200)
ames_data$TotalPorchArea <- ames_data$Wood.Deck.SF + ames_data$Open.Porch.SF + ames_data$Enclosed.Porch + ames_data$X3Ssn.Porch + ames_data$Screen.Porch
ames_data$HasPool <- ifelse(ames_data$Pool.Area > 0, 1, 0)
ames_data$HasFence <- ifelse(!is.na(ames_data$Fence), 1, 0)
ames_data$HasFireplace <- ifelse(ames_data$Fireplaces > 0, 1, 0)
ames_data$HasMiscFeature <- ifelse(!is.na(ames_data$Misc.Feature), 1, 0)
ames_data$OverallQualityScore <- ames_data$Overall.Qual * ames_data$Overall.CondCheck the first few rows of the new dataset
Select the required columns for the new dataset
SalePrice TotalBathrooms TotalSquareFootage HouseAge TotalGarageSize
1 215000 2.0 3951 50 928
2 105000 1.0 2404 49 930
3 172000 1.5 3581 52 512
4 244000 3.5 5285 42 922
5 189900 2.5 4049 13 882
6 195500 2.5 3810 12 870
TotalPorchArea HasPool HasFence HasFireplace HasMiscFeature
1 272 0 1 1 1
2 260 0 1 0 1
3 429 0 1 0 1
4 0 0 1 1 1
5 246 0 1 1 1
6 396 0 1 1 1
OverallQualityScore
1 30
2 30
3 36
4 35
5 25
6 36
Split the dataset into 80% training and 20% testing.
Ensures model evaluation is unbiased.
set.seed(123)
trainIndex <- createDataPartition(Selected_Ames_Data$SalePrice, p = 0.8, list = FALSE)
train <- Selected_Ames_Data[trainIndex, ]
test <- Selected_Ames_Data[-trainIndex, ]
cat("Data split complete:\n")
cat("Training set size:", nrow(train), "\n")
cat("Test set size:", nrow(test), "\n")Data split complete:
Training set size: 2346
Test set size: 584
Standardize numerical features so that all variables are on the same scale. (mean = 0, sd = 1).
Important for regularization techniques like LASSO.
# Standardize numerical features
preProc <- preProcess(train[, -1], method = c("center", "scale"))
train_scaled <- predict(preProc, train)
test_scaled <- predict(preProc, test)
cat("Scaling and centering complete.\n")scaling and centering complete.
Convert data to matrices for modeling.
Train LASSO model with cross-validation to find best lambda.
Fit final model using selected lambda.
x_train <- model.matrix(SalePrice ~ ., train_scaled)[, -1]
y_train <- train_scaled$SalePrice
x_test <- model.matrix(SalePrice ~ ., test_scaled)[, -1]
y_test <- test_scaled$SalePrice
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lambda <- cv_lasso$lambda.min
lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = best_lambda)# Predict on test data
y_pred <- predict(lasso_model, s = best_lambda, newx = x_test)
# Compute MSE and R-squared
mse <- mean((y_test - y_pred)^2)
r_squared <- 1 - sum((y_test - y_pred)^2) / sum((y_test - mean(y_test))^2)
cat("MSE:", mse, "\nR-squared:", r_squared)MSE: 1282393660
R-squared: 0.7947703
# Extract feature importance
lasso_coeffs <- coef(lasso_model, s = best_lambda)
feature_importance <- data.frame(Feature = rownames(lasso_coeffs), Coefficient = as.vector(lasso_coeffs))
feature_importance <- feature_importance[order(abs(feature_importance$Coefficient), decreasing = TRUE), ]
# Display top 5 important features
head(feature_importance, 5) Feature Coefficient
1 (Intercept) 180581.73
3 TotalSquareFootage 35081.10
4 HouseAge -19539.64
11 OverallQualityScore 18144.33
5 TotalGarageSize 13584.84
# Perform 5-fold cross-validation to evaluate the model
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 5)
# Best lambda value
best_lambda_cv <- cv_lasso$lambda.min
cat("Best Lambda from Cross-Validation: ", best_lambda_cv, "\n")
# Predict on test data using the best lambda
lasso_cv_model <- glmnet(x_train, y_train, alpha = 1, lambda = best_lambda_cv)
y_pred_cv <- predict(lasso_cv_model, s = best_lambda_cv, newx = x_test)
# Calculate MSE and R-squared for cross-validation
mse_cv <- mean((y_test - y_pred_cv)^2)
r_squared_cv <- 1 - sum((y_test - y_pred_cv)^2) / sum((y_test - mean(y_test))^2)
cat("MSE from Cross-Validation: ", mse_cv, "\n")
cat("R-squared from Cross-Validation: ", r_squared_cv, "\n")Best Lambda from Cross-Validation: 2566.186
MSE from Cross-Validation: 1285840101
R-squared from Cross-Validation: 0.7942187
The histogram shows the distribution of residuals (prediction errors) after cleaning, with no non-finite values. Residuals are centered around zero, indicating a good fit of the model to the data, with no significant bias.
# Calculate residuals and clean up any non-finite values
residuals_cv_clean <- na.omit(y_test - y_pred_cv)
# Plot histogram of cleaned residuals without scientific notation
ggplot(data.frame(residuals_cv_clean), aes(x = residuals_cv_clean)) +
geom_histogram(binwidth = 50000, fill = "lightblue", color = "black") +
scale_x_continuous(labels = scales::label_number()) + # Remove scientific notation on x-axis
labs(title = "Residuals Distribution (Cleaned)", x = "Residuals", y = "Frequency") +
theme_minimal()The cross-validation curve shows the relationship between the cross-validation error and lambda (regularization parameter). The optimal value of lambda is chosen at the minimum error (lambda.min), balancing model performance and complexity.
# Plot cross-validation curve
plot(cv_lasso)
title("Cross-validation Error vs Log(Lambda)", line = 2.5)The coefficient path plot illustrates how feature coefficients change as lambda increases. As regularization strengthens, less important features’ coefficients shrink to zero, indicating which variables are most influential in predicting house prices.
lasso_full <- glmnet(x_train, y_train, alpha = 1)
# Plot coefficient paths
plot(lasso_full, xvar = "lambda", label = TRUE)
title("LASSO Coefficient Path")The bar plot shows the non-zero coefficients retained by the LASSO model after regularization. These are the most influential features in predicting house prices, with higher coefficients indicating greater importance.
# Get non-zero coefficients
coef_lasso <- coef(lasso_model)
non_zero_coef <- coef_lasso[coef_lasso[, 1] != 0, , drop = FALSE]
# Create a tidy dataframe from non-zero coefficients
non_zero_df <- as.data.frame(as.matrix(non_zero_coef))
non_zero_df$Feature <- rownames(non_zero_df)
colnames(non_zero_df)[1] <- "Coefficient"
non_zero_df <- non_zero_df[non_zero_df$Feature != "(Intercept)", ] # Remove intercept if present
print(non_zero_coef)
# Plot
ggplot(non_zero_df, aes(x = reorder(Feature, Coefficient), y = Coefficient)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Selected Features by LASSO", x = "Feature", y = "Coefficient") +
theme_minimal()Efficient Feature Selection: Lasso helped identify the most influential predictors of house prices by shrinking less important coefficients to zero.
Improved Model Simplicity & Interpretability: The final model is not only more accurate but also easier to understand, which is valuable for both analysts and stakeholders.
Reduced Overfitting: By penalizing model complexity, Lasso enhances generalizability to new, unseen data.
Key Takeaways:
TotalSquareFootage, OverallQualityScore, and GarageSize emerged as strong predictors.
Binary indicators like HasPool or HasFireplace added meaningful context.
Lasso is ideal when dealing with many features and potential multicollinearity.
Future Work:
Compare with Ridge and Elastic Net for performance.
Apply cross-validation for more robust tuning.
Explore interactions and nonlinear relationships.